25 Model assumptions and reporting results

25.1 Learning goals

  • Review model assumptions.
  • Explore how to test for model assumptions.
  • What to do if model assumptions aren’t met.
  • How to report statistical results.

25.3 Model assumptions

25.3.1 Linear regression

25.3.2 Logistic regression

25.4 Regression diagnostics

“Regression diagnostics are methods for determining whether a fitted regression model adequately represents the data.” (p. 385) (Fox and Weisberg 2018)

25.4.1 Residual plots

Let’s illustrate via the credit card balance data set.

Let’s fit a linear regression and visualize the residuals.

#> 
#> Call:
#> lm(formula = balance ~ 1 + income + married, data = df.credit)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -816.96 -350.50  -50.68  331.67 1087.53 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 258.8819    41.4313   6.248 1.07e-09 ***
#> income        6.0587     0.5803  10.441  < 2e-16 ***
#> marriedYes  -20.9546    41.9260  -0.500    0.617    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 408.2 on 397 degrees of freedom
#> Multiple R-squared:  0.2155, Adjusted R-squared:  0.2115 
#> F-statistic: 54.52 on 2 and 397 DF,  p-value: < 2.2e-16

income is a significat predictor of balance but whether or not a person is married is not.

Let’s look at the residuals:

This plot helps assess whether there is homogeneity of variance. Overall, the residual plot looks pretty ok. The diagonal points in the bottom left of th plot arise because credit card balance is not an unbounded variable, and some of the people have a credit card balance of 0.

We can also check whether the residuals are normally distributed by plotting a density of the residuals, and a quantile quantile plot.

The residuals aren’t really normally distributed. As both the density and the QQ plot show, residuals with low/negative values are more frequent than residuals with high/positive values.

We can also inspect residual plots via using plot() on the model fit.

25.4.2 Influential data points

Because linear regression models are fitted by minimizing the squared error between prediction and data, the results can be strongly influenced by outliers. There are a number of ways of checking for outliers.

25.4.2.1 Leverage: Hat-values

Data points that are far from the center of the predictor space have potentially greater influence on the results – these points have high leverage.

hat-values are a way of characterizing how much influence individual data points have. hat values are bounded bewtween 1/n and 1, and the sum of all the hat values is equal to the number of coefficients in the model (including the intercept). If a data point has high influence this means that the results would change considerably if this data point were removed.

Based on this result, it looks like case 324 has a high influence on the predictions. Let’s see what the results would look like with, and without that data point.

#> [1] "hat1: 0.042"
#> [1] "hat2: 0.042"

Cook’s distance is defined as

\[ D_i = \frac{e^2_{Si}}{k + 1} \times \frac{h_i}{1-h_1}\],

where \(e^2_{Si}\) is the squared standardized residual, \(k\) is the number of coefficients in the model (excluding the intercept), and \(h_i\) is the hat-value for case \(i\).

Let’s double check here:

#> # A tibble: 10 x 2
#>        .cooksd        cook
#>          <dbl>       <dbl>
#>  1 0.000000289 0.000000289
#>  2 0.0000119   0.0000119  
#>  3 0.00281     0.00281    
#>  4 0.00238     0.00238    
#>  5 0.000519    0.000519   
#>  6 0.00308     0.00308    
#>  7 0.000510    0.000510   
#>  8 0.000530    0.000530   
#>  9 0.0000842   0.0000842  
#> 10 0.00500     0.00500

Looking good!

25.5 What if assumptions are violated?

25.5.1 Transforming the outcome variable

25.5.1.1 Logarithmic transform

#> 
#> Call:
#> lm(formula = infant_mortality ~ ppgdp, data = df.un)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -31.48 -18.65  -8.59  10.86  83.59 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 41.3780016  2.2157454  18.675  < 2e-16 ***
#> ppgdp       -0.0008656  0.0001041  -8.312 1.73e-14 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 25.13 on 191 degrees of freedom
#> Multiple R-squared:  0.2656, Adjusted R-squared:  0.2618 
#> F-statistic: 69.08 on 1 and 191 DF,  p-value: 1.73e-14
#> 
#> Call:
#> lm(formula = log(infant_mortality) ~ log(ppgdp), data = df.un)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.16789 -0.36738 -0.02351  0.24544  2.43503 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  8.10377    0.21087   38.43   <2e-16 ***
#> log(ppgdp)  -0.61680    0.02465  -25.02   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5281 on 191 degrees of freedom
#> Multiple R-squared:  0.7662, Adjusted R-squared:  0.765 
#> F-statistic: 625.9 on 1 and 191 DF,  p-value: < 2.2e-16
#> 
#> Call:
#> lm(formula = log(infant_mortality) ~ ppgdp, data = df.un)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.61611 -0.48094 -0.07858  0.53930  2.17745 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.479e+00  6.537e-02   53.23   <2e-16 ***
#> ppgdp       -4.595e-05  3.072e-06  -14.96   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7413 on 191 degrees of freedom
#> Multiple R-squared:  0.5394, Adjusted R-squared:  0.537 
#> F-statistic: 223.7 on 1 and 191 DF,  p-value: < 2.2e-16
#> 
#> Call:
#> lm(formula = infant_mortality ~ log(ppgdp), data = df.un)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -38.239 -11.609  -2.829   8.122  82.183 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 155.7698     7.2431   21.51   <2e-16 ***
#> log(ppgdp)  -14.8617     0.8468  -17.55   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 18.14 on 191 degrees of freedom
#> Multiple R-squared:  0.6172, Adjusted R-squared:  0.6152 
#> F-statistic:   308 on 1 and 191 DF,  p-value: < 2.2e-16

#> bcPower Transformation to Normality 
#>    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
#> Y1    0.0945           0      -0.0435       0.2326
#> 
#> Likelihood ratio test that transformation parameter is equal to 0
#>  (log transformation)
#>                            LRT df    pval
#> LR test, lambda = (0) 1.809774  1 0.17854
#> 
#> Likelihood ratio test that no transformation is needed
#>                            LRT df       pval
#> LR test, lambda = (1) 146.7235  1 < 2.22e-16

25.5.2 Non-parametric tests

25.5.2.1 Bootstrapping regressions

This section is based on this post here.

25.5.2.1.1 Visualize the data
#> `geom_smooth()` using formula 'y ~ x'

A simple linear regression doesn’t fit the data well (not suprising since we included a squared predictor).

25.5.2.1.2 Fit linear models
#> 
#> Call:
#> lm(formula = nap_time ~ 1 + turkey_time, data = df.turkey)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -212.82 -146.78  -55.17  125.74  462.52 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  15.4974    23.3827   0.663    0.508    
#> turkey_time  51.5746     0.8115  63.557   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 172.4 on 248 degrees of freedom
#> Multiple R-squared:  0.9422, Adjusted R-squared:  0.9419 
#> F-statistic:  4039 on 1 and 248 DF,  p-value: < 2.2e-16
#> # A tibble: 2 x 7
#>   term        estimate std.error statistic   p.value conf.low conf.high
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1 (Intercept)     15.5    23.4       0.663 5.08e-  1    -30.6      61.6
#> 2 turkey_time     51.6     0.811    63.6   1.72e-155     50.0      53.2

As the plot above shows, a simple linear model doesn’t predict the data well.

A regression with a squared predictor would fit well:

#> 
#> Call:
#> lm(formula = nap_time ~ 1 + I(turkey_time^2), data = df.turkey)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -45.611  -9.911  -0.652  11.137  43.008 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)      4.994e+02  1.575e+00   317.0   <2e-16 ***
#> I(turkey_time^2) 1.001e+00  1.439e-03   695.3   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 16.23 on 248 degrees of freedom
#> Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
#> F-statistic: 4.835e+05 on 1 and 248 DF,  p-value: < 2.2e-16

25.5.2.1.3 Fit a bootstrap regression
#> 
#> Number of bootstrap replications R = 999 
#>             original  bootBias bootSE bootMed
#> (Intercept)   15.497 -1.130589 29.330  15.080
#> turkey_time   51.575  0.023717  1.058  51.625
#> Bootstrap bca confidence intervals
#> 
#>                 2.5 %   97.5 %
#> (Intercept) -41.94541 75.90403
#> turkey_time  49.26083 53.45920
#> Bootstrap percent confidence intervals
#> 
#>                 2.5 %   97.5 %
#> (Intercept) -47.63914 71.63622
#> turkey_time  49.35224 53.60821

We see that the confidence intervals using the bootstrap method are wider than the ones that use the linear regression model.

25.6 Reporting results

25.6.1 Regression analysis

25.6.1.1 Tables

To export regression results into \(\latex\), I recommend the xtable and stargazer packages. The xtable package is nice and simple, wheres the stargazer package is more involved and allows for a lot of customization.

To show nice regression tables in RMarkdown, I recommend the sjPlot package.

  log(infant mortality)
Predictors Estimates CI Statistic p
(Intercept) 6.26 5.61 – 6.90 19.25 <0.001
ppgdp [log] -0.46 -0.52 – -0.40 -15.28 <0.001
group [other] 0.48 0.26 – 0.70 4.36 <0.001
group [africa] 1.05 0.76 – 1.34 7.15 <0.001
Observations 193
R2 / R2 adjusted 0.819 / 0.816

25.7 Additional resources

25.8 Session info

Information about this R session including which version of R was used, and what packages were loaded.

#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.5.0    stringr_1.4.0    dplyr_0.8.4      purrr_0.3.3     
#>  [5] readr_1.3.1      tidyr_1.0.2      tibble_2.1.3     ggplot2_3.3.0   
#>  [9] tidyverse_1.3.0  xtable_1.8-4     sjPlot_2.8.2     stargazer_5.2.2 
#> [13] ggeffects_0.14.1 patchwork_1.0.0  janitor_1.2.1    broom_0.5.5     
#> [17] car_3.0-6        carData_3.0-3    brms_2.12.0      Rcpp_1.0.3      
#> [21] lme4_1.1-21      Matrix_1.2-18    tidybayes_2.0.1  kableExtra_1.1.0
#> [25] knitr_1.28      
#> 
#> loaded via a namespace (and not attached):
#>   [1] readxl_1.3.1         backports_1.1.5      plyr_1.8.6          
#>   [4] igraph_1.2.4.2       splines_3.6.2        svUnit_0.7-12       
#>   [7] crosstalk_1.0.0      TH.data_1.0-10       rstantools_2.0.0    
#>  [10] inline_0.3.15        digest_0.6.25        htmltools_0.4.0     
#>  [13] rsconnect_0.8.16     fansi_0.4.1          magrittr_1.5        
#>  [16] openxlsx_4.1.4       modelr_0.1.6         matrixStats_0.55.0  
#>  [19] sandwich_2.5-1       xts_0.12-0           prettyunits_1.1.1   
#>  [22] colorspace_1.4-1     rvest_0.3.5          haven_2.2.0         
#>  [25] xfun_0.12            jsonlite_1.6.1       callr_3.4.2         
#>  [28] crayon_1.3.4         survival_3.1-8       zoo_1.8-7           
#>  [31] glue_1.3.1           gtable_0.3.0         emmeans_1.4.5       
#>  [34] webshot_0.5.2        sjstats_0.17.9       sjmisc_2.8.3        
#>  [37] pkgbuild_1.0.6       rstan_2.19.3         abind_1.4-5         
#>  [40] scales_1.1.0         mvtnorm_1.1-0        DBI_1.1.0           
#>  [43] miniUI_0.1.1.1       viridisLite_0.3.0    performance_0.4.4   
#>  [46] foreign_0.8-76       stats4_3.6.2         StanHeaders_2.21.0-1
#>  [49] DT_0.12              htmlwidgets_1.5.1    httr_1.4.1          
#>  [52] threejs_0.3.3        arrayhelpers_1.1-0   ellipsis_0.3.0      
#>  [55] farver_2.0.3         pkgconfig_2.0.3      loo_2.2.0           
#>  [58] dbplyr_1.4.2         utf8_1.1.4           labeling_0.3        
#>  [61] tidyselect_1.0.0     rlang_0.4.5          reshape2_1.4.3      
#>  [64] later_1.0.0          effectsize_0.2.0     munsell_0.5.0       
#>  [67] cellranger_1.1.0     tools_3.6.2          cli_2.0.2           
#>  [70] generics_0.0.2       sjlabelled_1.1.3     ggridges_0.5.2      
#>  [73] evaluate_0.14        fastmap_1.0.1        yaml_2.2.1          
#>  [76] fs_1.3.2             processx_3.4.2       zip_2.0.4           
#>  [79] nlme_3.1-145         mime_0.9             xml2_1.2.2          
#>  [82] compiler_3.6.2       bayesplot_1.7.1      shinythemes_1.1.2   
#>  [85] rstudioapi_0.11      curl_4.3             reprex_0.3.0        
#>  [88] stringi_1.4.6        ps_1.3.2             parameters_0.5.0    
#>  [91] Brobdingnag_1.2-6    lattice_0.20-40      nloptr_1.2.1        
#>  [94] markdown_1.1         shinyjs_1.1          vctrs_0.2.3         
#>  [97] pillar_1.4.3         lifecycle_0.1.0      bridgesampling_1.0-0
#> [100] estimability_1.3     data.table_1.12.8    insight_0.8.1       
#> [103] httpuv_1.5.2         R6_2.4.1             bookdown_0.18       
#> [106] promises_1.1.0       gridExtra_2.3        rio_0.5.16          
#> [109] codetools_0.2-16     boot_1.3-24          colourpicker_1.0    
#> [112] MASS_7.3-51.5        gtools_3.8.1         assertthat_0.2.1    
#> [115] withr_2.1.2          shinystan_2.5.0      multcomp_1.4-12     
#> [118] mgcv_1.8-31          bayestestR_0.5.2     parallel_3.6.2      
#> [121] hms_0.5.3            grid_3.6.2           coda_0.19-3         
#> [124] minqa_1.2.4          snakecase_0.11.0     rmarkdown_2.1       
#> [127] lubridate_1.7.4      shiny_1.4.0          base64enc_0.1-3     
#> [130] dygraphs_1.1.1.6

References

Fox, John, and Sanford Weisberg. 2018. An R Companion to Applied Regression. Sage publications.